MS2Tox Machine Learning Tool for Predicting the Ecotoxicity of Unidentified Chemicals in Water by Nontarget LC-HRMS

To achieve water quality objectives of the zero pollution action plan in Europe, rapid methods are needed to identify the presence of toxic substances in complex water samples. However, only a small fraction of chemicals detected with nontarget high-resolution mass spectrometry can be identified, and fewer have ecotoxicological data available. We hypothesized that ecotoxicological data could be predicted for unknown molecular features in data-rich high-resolution mass spectrometry (HRMS) spectra, thereby circumventing time-consuming steps of molecular identification and rapidly flagging molecules of potentially high toxicity in complex samples. Here, we present MS2Tox, a machine learning method, to predict the toxicity of unidentified chemicals based on high-resolution accurate mass tandem mass spectra (MS2). The MS2Tox model for fish toxicity was trained and tested on 647 lethal concentration (LC50) values from the CompTox database and validated for 219 chemicals and 420 MS2 spectra from MassBank. The root mean square error (RMSE) of MS2Tox predictions was below 0.89 log-mM, while the experimental repeatability of LC50 values in CompTox was 0.44 log-mM. MS2Tox allowed accurate prediction of fish LC50 values for 22 chemicals detected in water samples, and empirical evidence suggested the right directionality for another 68 chemicals. Moreover, by incorporating structural information, e.g., the presence of carbonyl-benzene, amide moieties, or hydroxyl groups, MS2Tox outperforms baseline models that use only the exact mass or log KOW.


Datasets
For training and testing MS2Tox, the toxicity values were downloaded from EPA CompTox Chemical Dashboard (https://comptox.epa.gov/dashboard/). The HRMS data for validation was downloaded from MassBank (massbank.eu/MassBank/). In-house measurements were used for final application testing.

Toxicity data from CompTox
As toxicity values LC50 values for three fish (fathead minnow, bluegill, rainbow trout) and water flea and EC50 values for water flea and algae were used. Toxicity values were extracted so that all the parameters for one training model were the same. In CompTox some toxicity values were presented as lower than (<) or higher than (>) and those data points were not included. The parameters of the specific endpoints used in this study are shown in Table S1. Chemicals that had several different values due to replicate measurements were averaged (median() from stats package) and experimental standard deviation (SD) was calculated additionally with R function sd() from stats package. Chemicals with SD higher than 1.5 log-mM unit were excluded. Chemicals that did not have replicate measurements are marked with "NA". Before model training, all the averaged values were calculated into molar concentration using the exact mass and the final used unit for toxicity values was log-mM. Toxicity scale for training was from -6.38 to 2.94 log-mM and the exact ranges for all training, test and, validation set are in Table S2. All used averaged toxicity values with standard deviations can be found in GitHub, in file https://github.com/kruvelab/MS2Tox/blob/main/Data/FinalModelData/final_model_toxdat a.txt. S3  For the fish model toxicity values for three fish were combined using a generalized additive model (gam() from R package mgcv) to bring toxicity values for all three fish into a common fathead minnow scale. In Table S3 a summary of the final dataset: number of chemicals, pooled experimental standard deviation, and standard deviation quantiles are given. Standard deviation for an endpoint is pooled root mean square (RMS) with the equation:  MassBank data from the three labs 13 different chromatographic conditions were used, where the single largest dataset measured under the same chromatographic conditions consisted of 691 unique chemicals. This number is lower than what we were able to use by only using molecular fingerprints that can be computed from the MS 2 data. Though chromatographic information could be very beneficial for predicting toxicity, the data sparsity needs to be overcome first. One possibility would be to use a GAM model to convert retention times from one instrument/column/lab to another. However, this is much less likely to succeed as the column and mobile phase, especially pH do not affect all chemicals to an equal extent. 1 Also, the model would only be applicable to one separation mode, e.g. if reversed chromatography retention times are used, the predictions can also be made only for reversed-phase chromatography and not for HILIC. For that reason adding retention time information to the model would limit its later application of it.
Similarly, CCS values that can be measured with ion mobility provide information about the 3D structure of the chemicals and therefore possibly about the toxic interaction of the chemical with a protein. Unfortunately, CCS values can be obtained currently only with a very small fraction of HRMS instruments used for non-target analysis. Therefore, the training of such models as well as application would be limited.
The current version of MS2Tox allows toxicity predictions for any chromatography type, with or without ion mobility, and even for measurements where no chromatographic or ion mobility separation has been used, e.g. flow injections, paper spray, or MALDI.
For HRMS data, some of the chemicals had several measurements done with different parameters and both in positive and negative modes. Also, different collision energies were used. MS 2 spectra with otherwise identical LC-HRMS conditions but different collision energies and or resolution were written into one .ms file so that each file contained information about the collision voltage used as well as all fragments and their intensities were listed under the specific collision voltage. Combining multiple collision voltages captures more fragments, allows compiling a deeper fragmentation tree, and therefore, more accurate fingerprint predictions. For spectra obtained in ramp mode, no combining was done. For this reason, MassBank id that is seen in the final datasets corresponds to the last spectra with the same LC-HRMS parameters. An example of such a .ms file is given in Figure S1. E.g. acetochlor from Eawag dataset had fourteen spectra (EA010401-EA010414) with different collision energies and/or resolutions and for this compound, one data point with id EA010414 was created for fingerprint calculations. For fingerprint calculations, the number of unique chemicals was 824 for Athens, 913 for Eawag, and 849 for LCSB. For fingerprint calculations, only chemicals with available toxicity value in CompTox were used.

Application solutions
For application, three spiked water samples (Table S4 to Table S6), containing altogether 152 chemicals, from NORMAN inter-laboratory comparison were measured in-house. Measurements were carried out on ESI-LC-HRMS consisting of Thermo Scientific Dionex Ultimate 3000 (Thermo Fischer Scientific, USA) with an RS binary pump and Thermo Scientific Q Executive Orbitrap with Kinetex 2.6 µm EVO C18 (150 x 3.0 mm) reversed-phase column from Phenomenex (Torrence, CA, USA).
For LC separation 0.1% formic acid (Merck, Darmstadt Germany) solution and acetonitrile (HPLC grade, Riedel-de-Haën, Seelze Germany) were used. The ultrapure water with resistance 18.2 MΩ cm and TOC < 5 ppb was prepared by Milli-Q IQ 7000 device from Merck (Darmstadt, Germany). The gradient started with 5% acetonitrile and was increased to 100% over 20 min, was kept at 100% acetonitrile for 5 min, and then lowered back to 5% over 0.1 min. The system was equilibrated for 5 min between runs. The column oven temperature was 40 ˚C and the mobile phase flow rate was 0.350 mL/min.
The autosampler temperature was kept at 15 ˚C. Auxiliary gas, sheath gas, and sweep gas flow rates were set to 3, 35, and 0 arbitrary units, respectively. The auxiliary gas and capillary temperature were 320 ˚C and the S-lens RF level was 50%. For MS 2 data ramp was used in collision energy.
An inclusion list was used to trigger MS 2 spectra for chemicals spiked to the samples. All together three solutions were analyzed containing compounds shown in Table S4 to Table S6. However, out of 152 unique spiked chemicals, 90 were in final calculations and all chemicals were not detected with MS 2 spectra when using [M+H] + and [M-H]as precursors.   colname -unique name given for fingerprint-based on SIRIUS absoluteIndex which is used as column name later in building the prediction model.
In addition to fingerprints, the exact mass of the compound was used as a descriptive feature.
We additionally evaluated the overlap of the training, test, and validation set in the chemical space. For this the substructure, MACCS, PubChem, and Klekota-Roth fingerprints were calculated with rcdk for all chemicals in the three sets and used as an input to the principal component analysis (PCA). The first two principal components shown in Figure S2 explained 16 % of the total variability in the fingerprints. In spite of the low explained variability, the visual analysis of the first two principal components assured a good overlap in the chemical space of the training, test, and validation set, see Figure S2. S17 Figure S2. PC1 and PC2 graph for static fish LC50 dataset. Different colors show the training, test, and validation set.

Data purification before model training
Before training near-zero variance fingerprints and highly correlated fingerprints were filtered out using caret package functions nearZeroVar() with default parameters and findCorrelation() and cor() with cutoff = 0.8. The number of remaining fingerprints varied from 191 to 243 depending on the species and endpoint and is shown in

Fingerprints calculated with SIRIUS software
For toxicity predictions for mass-spectrometric data, fingerprints were calculated from HRMS data using SIRIUS+CSI:FingerID 2 application. SIRIUS+CSI:FingerID computes the fingerprints for plausible molecular formulas from .ms files containing MS 1 and MS 2 spectra. SIRIUS+CSI:FingerID constructs a fragmentation tree where smaller fragments are related to the higher fragments in a consecutive manner. In this construction, the fragments and losses with plausible molecular formulas are organized relative to each other. Fingerprints are then predicted using these constructed fragmentation trees. In the process of creating a fragmentation tree also different plausible molecular formulas are created, within the allowed mass accuracy. However, MS2Tox does not rely on the molecular formula assignment, only on the predicted fingerprints. Therefore, in the context of MS2Tox, the molecular formula predicted by SIRIUS is a by-product. To study how different are fingerprints from the same MS2 data but with different assigned formulas, cosine similarities were calculated for each data point. The histogram in Figure S3 shows that for most of the chemicals, fingerprints were very similar with average cosine similarities over 0.6 and mostly over 0.8 even. These results prove that fingerprints calculated from the same MS2 are similar even with the wrong assigned formula and thus can give the correct predicted toxicity value.
For many HRMS spectra several (occasionally up to 11) possible chemical formulas were predicted. From 680 spectra 306 had more than one possible formula calculated by SIRIUS+CSI:FingerID. 39 spectra had more than ten or more possible molecular formulas. From 680 results correct formula was ranked among three best-ranked formulas on 674 times and the best rank 592 times, meaning that 13% of the cases incorrect formula was rank one.
For the fish static model in 8% of the cases, an incorrect formula was ranked highest. In spite of the occasional highest ranking of incorrect formulas, for final toxicity predictions, only the first rank is used. In order to find out if this decision was justified, predicted toxicities were also calculated using all formulas and the first three ranked formulas and averaged. Figure 2C in the main text shows that in all three cases the RMSE were comparable and using rank one formulas gave the most similar results to when using the correct formula. This indicates that even if the correct formula is not ranked highest, the fingerprints can still be predicted accurately. Figure S3. Histogram showing the median cosine similarities between different fingerprints calculated from the same MS2 spectra but with different assigned formulas.
In this work SIRIUS calculations were done using SIRIUS graphical user interface (GUI) version 4.9.5. For fingerprint calculations ".ms" file from each experiment was created so that file contained the compound id, parent mass, MS 2 spectra on different collision energies, and MS 1 data. For MassBank MS 1 spectra were not provided, therefore, an isotope pattern calculated from the correct molecular formula was used. For SIRIUS calculations mass accuracy of 5 ppm for Orbitrap was used and all databases available in SIRIUS+CSI:FingerID software (e.g PubChem, Bio database, NORMAN) were searched for possible structural matches. Some compounds that had a mass over 600 Da could not always be calculated due to too many possible formula matches being found. This S20 arouses from a too wide range of allowed chemical elements (H, C, N, O, P, B, Si, S, Cl, Se, Br, F, I, K, Na, As).
After SIRIUS+CSI:FingerID calculations a folder that contain all predicted molecular formulas with respective fingerprints, scores, and fragmentation trees is created. Code for gathering this data into a data table can be found in GitHub R package MS2Tox.

Performance evaluation for the fingerprint calculation
The fingerprints calculated with SIRIUS+CSI:FingerID for the correct formula were compared to the fingerprints calculated directly from the structure. 98% of the fingerprints were correctly calculated by SIRIUS+CSI:FingerID software. Here correct predictions mean that if a specific fingerprint was present in the structure its predicted probability was above 0.5 as the regression trees treat all values below 0.5 as zeros and above 0.5 as one. However, for some spectra fingerprint calculations were less accurate. In Table S8 are given an overview of the spectra with the most miscalculated fingerprints. As for fingerprint calculations, different collision energies were combined, MassBank ID corresponds to the last file in the list. The number of combined MS 2 on different collision energies is given in column "number of MS 2 spectra". The table is corresponding to fish static data and only those fingerprints are selected that are used in the final models. In Table S9 most miscalculated fingerprints can be seen for the same dataset.   For calculating weights experimental SD for each chemical was used. For compounds that had several replica measurements weight = 1/SD. For those compounds that did not have replicas, SD was taken as equal to median SD errors over the dataset and then again weight = 1/SDmedian. As the range for these weights was really wide (e.g 0.68 to 7459 for fish static) all the weight values were scaled using the logarithmic value of the weight to range 1-5. Conventional scaling strategies would press the weight of less accurate measurements to zero, this was undesirable and the scaling to 1-5 was used instead. Formula to get final weights in scale 1-5: Weight final = ((weight-weightmin)/(weightmax-weightmin))*4+1.
In the case of the three fish model, the LC50 values of bluegill and rainbow trout were calculated into fathead minnow scale using the generalized additive model (GAM). Due to this conversion to LC50 values of fathead minnow have a higher weight in the model training. For this, initial weights were calculated similarly to the previous: weights = 1/SD, but after that fathead minnow weights were multiplied with two. After that, the weights were scaled to range 1-5.
All chemicals that had HRMS spectra in MassBank were removed from the training set and kept for validation of the models. The remaining chemicals were split 80/20 in training and test set. For final model available in GitHub all three datasets were combined and all compounds were used for training the model.

Comparison of different machine learning models
Five different machine learning approaches were tested for toxicity predictions models. Three of them were different extreme gradient boost, one support vector machine, and one random forest. All other parameters, including selected fingerprints, separation of training and test set, weights and model training parameters, were identical. For training ten-fold crossvalidation with "boot" method was used from caret. The performance of all used methods is described in Table S10.

Validation with MS 2 data from MassBank
For validation of the method, fingerprints calculated from MS 1 and MS 2 spectra with SIRIUS+CSI:FingerID were used. All compounds used for validation were excluded from the training and test set. Datasets for validation are described above in section "HRMS data from MassBank". The results of the endpoint predictions for the validation set are in GitHub files (https://github.com/kruvelab/MS2Tox/blob/main/Data/Validation.zip) as well as all fingerprints calculated with SIRIUS+CSI:FingerID for MassBank data with respective SiriusScores.

Variable importance and SHAP analysis
In order to understand the information learned by the xgbDART algorithm, the importance and the contribution of each fingerprint to the prediction was evaluated with SHapley Additive exPlanations (SHAP
Predicted toxicity values were compared to measured retention time. It can be seen Figure  S13, that there is a negative correlation between retention time and LC50 value that indicates that compounds with higher retention time have lower LC50. However, the correlation is vague, meaning that the toxicity predictions by MS2Tox account for also other properties besides the hydrophobicity that is primarily correlated with the retention time. For comparison, experimental fish static LC50 were also correlated with MassBank data, see Figure  S14. Similarly to Figure S13 only a vague correlation with the retention time is observed, indicating that the mechanism of action also is more complex.